Podcast Reviews Data Analysis
¶
This database for this project is the iTunes Podcast Reviews, sourced from the Scraped iTunes podcast review RSS feeds. It contains the information that spans from 2019 to 2023 across USA, offering valuable time-series data in a specific location.
¶
Business Stakeholders and Objectives
¶
- Podcast creators / Podcast sponsors: Interested in understanding audience preferences and improving their content based on feedback.
- Marketing teams: Aim to identify effective marketing strategies and optimize promotional efforts.
- Data analysts: Responsible for extracting insights from the data to inform decision-making.
¶
- Podcast creators: To identify popular podcast genres/topics and areas for content improvement through analysis of listener engagement and feedback.
- Marketing teams: To analyze trends in podcast listenership and sentiment to inform marketing campaigns and target audience outreach.
- Data analysts: To conduct thorough exploratory analysis to extract actionable insights from the podcast reviews dataset.
# %pip install python-dotenv pandas kaggle numpy textblob nbformat scipy scikit-learn statsmodels sqlalchemy plotly
# Terminal commands:
# conda activate rapids-24.02
# conda install -c conda-forge python-dotenv pandas numpy textblob nbformat scipy scikit-learn statsmodels sqlalchemy plotly
# Install RAPIDS (WSL PowerShell):
# conda create --solver=libmamba -n rapids-24.02 -c rapidsai -c conda-forge -c nvidia \
# rapids=24.02 python=3.10 cuda-version=12.0 \
# jupyterlab dash
import os
from dotenv import load_dotenv
load_dotenv()
from numba import cuda
cuda.detect()
from sqlalchemy import create_engine
import sqlite3
from math import sqrt
%load_ext cudf.pandas
import pandas as pd
import numpy as np
import cudf
from concurrent.futures import ThreadPoolExecutor
from textblob import TextBlob
from collections import Counter
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
nltk.download("vader_lexicon")
nltk.download("punkt")
from scipy.stats import (
pointbiserialr,
f_oneway,
chi2_contingency,
norm,
spearmanr,
t,
pearsonr,
ttest_ind,
)
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import plotly
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import plotly.subplots as sp
pio.renderers.default = "notebook"
plotly.offline.init_notebook_mode()
Found 1 CUDA devices
id 0 b'NVIDIA GeForce RTX 2060' [SUPPORTED]
Compute Capability: 7.5
PCI Device ID: 0
PCI Bus ID: 1
UUID: GPU-d5fffe5d-eda6-e044-a96d-7e29d7648f51
Watchdog: Enabled
FP32/FP64 Performance Ratio: 32
Summary:
1/1 devices are supported
[nltk_data] Downloading package vader_lexicon to [nltk_data] /home/cannelle/nltk_data... [nltk_data] Package vader_lexicon is already up-to-date! [nltk_data] Downloading package punkt to /home/cannelle/nltk_data... [nltk_data] Package punkt is already up-to-date!
from utils.functions import (
print_missing_and_duplicates,
map_to_general_category,
analyze_sentiment,
)
# kaggle_json_filename = "kaggle.json"
# notebook_directory = os.getcwd()
# kaggle_json_path = os.path.join(notebook_directory, kaggle_json_filename)
# if os.path.exists(kaggle_json_path):
# os.environ['KAGGLE_CONFIG_DIR'] = notebook_directory
# import kaggle
# else:
# print("Error: kaggle.json file not found in the project root directory.")
# kaggle.api.authenticate()
# kaggle.api.dataset_download_files(dataset="thoughtvector/podcastreviews", path="./datasets", unzip=True)
# download_path = "./datasets"
# old_file_path = os.path.join(download_path, "database.db")
# new_file_path = os.path.join(download_path, "database.sqlite")
# if os.path.exists(old_file_path):
# os.rename(old_file_path, new_file_path)
cnx = sqlite3.connect("./datasets/database.sqlite")
df = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", cnx)
print(df)
name 0 runs 1 podcasts 2 categories 3 reviews
categories = pd.read_sql_query("SELECT * FROM Categories", cnx)
podcasts = pd.read_sql_query("SELECT * FROM Podcasts", cnx)
reviews = pd.read_sql_query("SELECT * FROM Reviews", cnx)
runs = pd.read_sql_query("SELECT * FROM Runs", cnx)
display(categories.head(2))
display(podcasts.head(2))
display(reviews.head(2))
display(runs.head(2))
| podcast_id | category | |
|---|---|---|
| 0 | c61aa81c9b929a66f0c1db6cbe5d8548 | arts |
| 1 | c61aa81c9b929a66f0c1db6cbe5d8548 | arts-performing-arts |
| podcast_id | itunes_id | slug | itunes_url | title | |
|---|---|---|---|---|---|
| 0 | a00018b54eb342567c94dacfb2a3e504 | 1313466221 | scaling-global | https://podcasts.apple.com/us/podcast/scaling-... | Scaling Global |
| 1 | a00043d34e734b09246d17dc5d56f63c | 158973461 | cornerstone-baptist-church-of-orlando | https://podcasts.apple.com/us/podcast/cornerst... | Cornerstone Baptist Church of Orlando |
| podcast_id | title | content | rating | author_id | created_at | |
|---|---|---|---|---|---|---|
| 0 | c61aa81c9b929a66f0c1db6cbe5d8548 | really interesting! | Thanks for providing these insights. Really e... | 5 | F7E5A318989779D | 2018-04-24T12:05:16-07:00 |
| 1 | c61aa81c9b929a66f0c1db6cbe5d8548 | Must listen for anyone interested in the arts!!! | Super excited to see this podcast grow. So man... | 5 | F6BF5472689BD12 | 2018-05-09T18:14:32-07:00 |
| run_at | max_rowid | reviews_added | |
|---|---|---|---|
| 0 | 2021-05-10 02:53:00 | 3266481 | 1215223 |
| 1 | 2021-06-06 21:34:36 | 3300773 | 13139 |
¶
There are 4 tables:
- Categories - Categories data [
podcast_id,category] - Podcasts - Podcasts data [
podcast_id,itunes_id,slug,itunes_url,title] - Reviews - Reviews data [
podcast_id,title,content,rating,author_id,created_at] - Runs - Runs data [
run_at,max_rowid,reviews_added]
- Missing values in a dataset can lead to inaccurate or misleading statistics and machine learning model predictions. They can occur due to various reasons such as data entry errors, failure to collect information, etc. Depending on the nature and extent of these missing values, different strategies can be employed to handle them.
- Duplicate values in a dataset can occur due to various reasons such as data entry errors, merging of datasets, etc. Duplicates can lead to biased or incorrect results in data analysis. Therefore, it’s important to identify and remove duplicates.
print_missing_and_duplicates(categories, "Categories")
print_missing_and_duplicates(podcasts, "Podcasts")
print_missing_and_duplicates(reviews, "Reviews")
print_missing_and_duplicates(runs, "Runs")
Duplicates in Reviews table: 655
Drop Duplicative Rows:¶
reviews = reviews.drop_duplicates()
Chosen Strategy for Organizing Tables:¶
1. Merging Tables:
- The three tables are merged based on the
podcast_idvalue. - The rows are sorted based on this value.
- Merging the tables has been chosen to streamline the workflow to facilitate the use of data for various comparisons.
merged_table = pd.merge(categories, podcasts, on="podcast_id", how="outer")
merged_table = pd.merge(merged_table, reviews, on="podcast_id", how="outer")
merged_table.sort_values(by=["podcast_id"], inplace=True)
merged_table = merged_table.reset_index(drop=True)
2. Preparing the data for display:
- Mapping the category values to fit into one of the categories -
Business & Finance,Religion & Spirituality,News & Politics,Sports & Recreation,Arts,Education,Society & Culture,TV & Film,Health & Fitness,Others,Music,True Crime,Comedy,History,Leisure,Kids & Family,Science,Fiction,Technology,Government
processed_dataset = merged_table.copy()
processed_dataset["category"] = processed_dataset["category"].apply(
map_to_general_category
)
processed_dataset["podcast_title"] = (
processed_dataset["title_x"].fillna("")
+ " "
+ processed_dataset["title_y"].fillna("")
)
processed_dataset.head(2)
| podcast_id | category | itunes_id | slug | itunes_url | title_x | title_y | content | rating | author_id | created_at | podcast_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | a00018b54eb342567c94dacfb2a3e504 | Business & Finance | 1.313466e+09 | scaling-global | https://podcasts.apple.com/us/podcast/scaling-... | Scaling Global | Very informative | Great variety of speakers! | 5 | CC47C85896D423B | 2017-11-29T12:16:43-07:00 | Scaling Global Very informative |
| 1 | a00043d34e734b09246d17dc5d56f63c | Religion & Spirituality | 1.589735e+08 | cornerstone-baptist-church-of-orlando | https://podcasts.apple.com/us/podcast/cornerst... | Cornerstone Baptist Church of Orlando | Good Sernons | I'm a regular listener. I only wish that the ... | 5 | 103CC9DA2046218 | 2019-10-08T04:23:32-07:00 | Cornerstone Baptist Church of Orlando Good Ser... |
OUTCOMES:
- No missing values were found in the datasets.
- 655 duplicates were found in the Reviews table and were subsequently dropped.
- The decision was made to reorganize the data into one table, as this could potentially facilitate analysis in further steps.
¶
Preliminary Plan for Data Exploration
Basic Exploration
- Utilize the
describefunction to provide an overview of numerical and categorical features in each dataset. - Check the distributions of podcasts over categories, ratings, and number of reviews over time.
Detailed Exploration
Data Sampling:
- Preparing a subset of data
Trends Characteristics Analysis:
- Understanding podcast listenership trends
- Identifying popular podcast genres/topics
- Analyzing sentiment of podcast reviews
Statistical Interference:
- Correlation between average ratings and voting counts
- Variances in rating averages across podcast categories
- Monthly variations in rating averages
print("Processed Dataset")
display(processed_dataset.describe(include=["object"]).T)
Processed Dataset
| count | unique | top | freq | |
|---|---|---|---|---|
| podcast_id | 4552196 | 111544 | bf5bf76d5b6ffbf9a31bba4480383b7f | 33100 |
| category | 4552196 | 20 | Society & Culture | 661552 |
| slug | 4527973 | 108919 | crime-junkie | 33100 |
| itunes_url | 4527973 | 110024 | https://podcasts.apple.com/us/podcast/crime-ju... | 33100 |
| title_x | 4527973 | 109274 | Crime Junkie | 33100 |
| title_y | 4552196 | 1138688 | Great podcast | 30828 |
| content | 4552196 | 2049707 | I love this podcast! | 404 |
| author_id | 4552196 | 1475285 | D3307ADEFFA285C | 1660 |
| created_at | 4552196 | 2054352 | 2017-09-19T08:29:49-07:00 | 14 |
| podcast_title | 4552196 | 1868684 | Crime Junkie Obsessed | 466 |
unique_ratings = processed_dataset["rating"].unique()
count_ratings = len(unique_ratings)
top_rating = processed_dataset["rating"].mode().values[0]
top_rating_freq = processed_dataset["rating"].value_counts().max()
total_ratings = processed_dataset["rating"].count()
print("Unique Ratings:", unique_ratings)
print("Count of Unique Ratings:", count_ratings)
print("Most Common Rating (Top):", top_rating)
print("Frequency of Most Common Rating:", top_rating_freq)
print("Total Number of Ratings:", total_ratings)
Unique Ratings: [5 1 4 2 3] Count of Unique Ratings: 5 Most Common Rating (Top): 5 Frequency of Most Common Rating: 3982850 Total Number of Ratings: 4552196
category_counts = processed_dataset["category"].value_counts()
fig = px.bar(
x=category_counts.index,
y=category_counts.values,
labels={"x": "Category", "y": "Count"},
)
fig.update_layout(
title="Podcast Counts by Category",
xaxis_title="Category",
yaxis_title="Podcast Count",
template="plotly_dark",
)
fig.show()
rating_counts = processed_dataset["rating"].value_counts()
fig = px.bar(
x=rating_counts.index, y=rating_counts.values, labels={"x": "Rating", "y": "Count"}
)
fig.update_layout(
title="Podcast Counts by Rating",
xaxis_title="Rating",
yaxis_title="Podcast Count",
template="plotly_dark",
)
fig.show()
runs["run_date"] = pd.to_datetime(runs["run_at"]).dt.date
reviews_added_per_day = runs.groupby("run_date")["reviews_added"].sum().reset_index()
fig = px.line(
reviews_added_per_day,
x="run_date",
y="reviews_added",
title="Reviews Added Over Time",
template="plotly_dark",
)
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Number of Reviews Added")
fig.show()
OUTCOMES:
- Podcast Counts by Category: The Society & Culture category has the highest number of podcasts (661,552k), followed by Business & Finance (435,586k) and Comedy (413,024k). Conversely, the categories with the fewest podcasts are Government (15,483k), Others (25,906k), and Technology (47,808k).
- Podcast Counts by Rating: The majority of podcasts received a rating of 5 (3.98 million), indicating the highest level of satisfaction, while the fewest were rated as 2 (94.62k) on a scale of 1-5.
- Reviews Added Over Time: The highest number of reviews were recorded on May 10, 2021 (exceeding 1.2 million), followed by July 3, 2022 (559,523k).
total_rows = len(processed_dataset)
sampling_percentage = 0.1 # 10%
sample_size = int(total_rows * sampling_percentage)
sampled_data = processed_dataset.sample(n=sample_size, random_state=42)
display(sampled_data.head(2))
| podcast_id | category | itunes_id | slug | itunes_url | title_x | title_y | content | rating | author_id | created_at | podcast_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 952553 | b50cbb051c6f5b6bb196a12b7a4dc740 | Education | 1.115025e+09 | investing-in-real-estate-clayton-morris-build-... | https://podcasts.apple.com/us/podcast/investin... | Investing in Real Estate with Clayton Morris |... | This podcast is the reason I own rentals | I started listening to this podcast in May of ... | 5 | 7BA3C402329DC10 | 2019-08-02T18:13:11-07:00 | Investing in Real Estate with Clayton Morris |... |
| 1952455 | c7c948797b0f044fa5e7bb17068fe734 | Kids & Family | 1.148570e+09 | your-parenting-mojo-respectful-research-based-... | https://podcasts.apple.com/us/podcast/your-par... | Your Parenting Mojo - Respectful, research-bas... | I love this show! | I basically listen right when new epiodes come... | 5 | 44C9191892B0C4F | 2017-07-03T15:26:12-07:00 | Your Parenting Mojo - Respectful, research-bas... |
# Database Preparation for Export to Looker
# max_file_size_mb = 99
# sampling_percentage = 0.1
# sampling_percentage_step = 0.005
# export_data = None
# file_size_mb = float('inf')
# while sampling_percentage > 0:
# export_sample_size = int(len(processed_dataset) * sampling_percentage)
# sampled_data = processed_dataset.sample(n=export_sample_size, random_state=42)
# sampled_data.to_csv('export_data.csv', index=False)
# file_size_mb = os.path.getsize('./datasets/export_data.csv') / (1024 * 1024)
# if file_size_mb <= max_file_size_mb:
# export_data = sampled_data
# print(f"Sampled data exported with a file size of {file_size_mb:.2f} MB.")
# break
# sampling_percentage -= sampling_percentage_step
# if sampling_percentage <= 0:
# print("Could not achieve the desired file size within the given constraints.")
# print(f"Final sampling percentage used: {sampling_percentage:.3f}")
most_rated_query_df = (
sampled_data.groupby(["podcast_title"])
.agg({"rating": ["count", "mean"]})
.reset_index()
)
most_rated_query_df.columns = ["podcast_title", "rating_count", "avg_rating"]
most_rated_query_df = most_rated_query_df.sort_values(
by="rating_count", ascending=False
).head(10)
most_rated_query_df_count = most_rated_query_df.sort_values(
by="rating_count", ascending=False
)
best_rated_query_df_avg = most_rated_query_df.sort_values(
by="avg_rating", ascending=False
)
fig1 = px.bar(
most_rated_query_df_count,
x="podcast_title",
y="rating_count",
title="Top 10 Podcasts by Review Frequency",
template="plotly_dark",
hover_data={"rating_count": True, "avg_rating": True},
)
fig1.update_xaxes(title="Podcast Title")
fig1.update_yaxes(title="Number of Reviews Received")
fig2 = px.bar(
best_rated_query_df_avg,
x="podcast_title",
y="avg_rating",
title="Top 10 Podcasts by Average Rating",
template="plotly_dark",
hover_data={"rating_count": True, "avg_rating": True},
)
fig2.update_xaxes(title="Podcast Title")
fig2.update_yaxes(title="Average Rating")
fig1.show()
fig2.show()
most_rated_query_df = (
sampled_data.groupby(["category"]).agg({"rating": ["count", "mean"]}).reset_index()
)
most_rated_query_df.columns = ["category", "rating_count", "avg_rating"]
most_rated_query_df = most_rated_query_df.sort_values(
by="rating_count", ascending=False
).head(10)
most_rated_query_df_count = most_rated_query_df.sort_values(
by="rating_count", ascending=False
)
best_rated_query_df_avg = most_rated_query_df.sort_values(
by="avg_rating", ascending=False
)
fig1 = px.bar(
most_rated_query_df_count,
x="category",
y="rating_count",
title="Top 10 Categories by Review Frequency",
template="plotly_dark",
hover_data={"rating_count": True, "avg_rating": True},
)
fig1.update_xaxes(title="Podcast Category")
fig1.update_yaxes(title="Number of Reviews Received")
fig2 = px.bar(
best_rated_query_df_avg,
x="category",
y="avg_rating",
title="Top 10 Categories by Average Rating",
template="plotly_dark",
hover_data={"rating_count": True, "avg_rating": True},
)
fig2.update_xaxes(title="Podcast Category")
fig2.update_yaxes(title="Average Rating")
fig1.show()
fig2.show()
sampled_data["sentiment"] = sampled_data["content"].apply(analyze_sentiment)
fig = px.histogram(
sampled_data,
x="sentiment",
nbins=30,
title="Sentiment Distribution of Podcast Reviews",
)
fig.update_layout(
xaxis_title="Sentiment Polarity",
yaxis_title="Frequency",
bargap=0.05,
template="plotly_dark",
)
fig.show()
OUTCOMES:
- The most rated podcasts were "Crime Junkie Obsessed" with 48 ratings, "Wow in the World Love it" with 45 ratings, and "Wow in the World Awesome" with 41 ratings. These podcasts had respective average ratings of 5.00, 4.98, and 4.93.
- The podcasts with the highest average ratings were "Crime Junkie Obsessed," "Daebak Show w/ Eric Nam Amazing," "Crime Junkie Amazing," and "Crime Junkie Love it," all achieving a perfect average rating of 5.0.
- The most rated podcast category was "Society & Culture," which amassed over 66 thousand ratings out of more than 4.5 million total ratings, accounting for approximately 1.45% of the total. Other highly rated categories include Business and Finance, with over 43 thousand ratings (0.94% of the total), and Comedy, with more than 41 thousand ratings (0.9%).
- Podcasts in the Business & Finance category achieved the highest average ratings, with an impressive average of 4.85. Following closely are podcasts in the Religion & Spirituality category and Education category, both with an average rating of 4.83.
- The sentiment analysis indicates a predominantly positive sentiment in podcast reviews, with the majority of ratings falling within the positive range.
INSIGHTS:
- The most rated podcasts have accumulated more than 40 ratings each, with highly favorable average ratings ranging between 5.00 and 4.93.
- Podcasts consistently rated with an average of 5.0 not only attract a substantial audience but also consistently deliver content that resonates exceptionally well, resulting in consistently high average ratings across episodes.
- While "Society & Culture" may have the most rated podcasts, along with Business and Finance, and Comedy attracting a significant number of ratings, it's noteworthy that podcasts in the Business & Finance category receive the highest average ratings. This suggests that listeners highly appreciate the quality and value of content offered in this genre. Similarly, Religion & Spirituality and Education categories also boast exceptionally high average ratings, indicating strong listener satisfaction in these areas.
- The distribution of sentiment analysis highlights a predominantly positive sentiment in podcast reviews, indicating high overall satisfaction among listeners.
⇡¶
Target Population: The target population consists of all podcast reviews available in the dataset. However, for the purpose of statistical inference, we are working with a sample of the data rather than the entire population.Significance Levels:
The chosen significance level for hypothesis testing α = 0.05.
Podcast Ratings VS Voter Count
Confidence Intervals:
The confidence interval for the count of ratings is [-15.969287064551526, 3480.769498138696]. This interval provides a range of plausible values for the true population parameter, the total count of ratings for all podcasts in the dataset, with a specified level of confidence 95%. The lower bound of the confidence interval -15.969287064551526 represents the lower estimate of the count of ratings, while the upper bound 3480.769498138696 represents the upper estimate. We can be 95% confident that the true count of ratings falls within this interval.
sample_size = sampled_data.groupby("podcast_id")["rating"].count()
confidence_level = 0.95
t_value = t.ppf((1 + confidence_level) / 2, df=sample_size - 1)
standard_error = np.sqrt(sample_size)
small_sample_mask = sample_size < 2
t_value[small_sample_mask] = 0
margin_of_error = t_value * standard_error
confidence_interval_counts = (
sample_size - margin_of_error,
sample_size + margin_of_error,
)
print("Confidence Interval for Count of Ratings:")
display(confidence_interval_counts[:5])
lower_bounds = confidence_interval_counts[0]
upper_bounds = confidence_interval_counts[1]
overall_lower_bound = np.min(lower_bounds)
overall_upper_bound = np.max(upper_bounds)
print("Overall Lower Bound:", overall_lower_bound)
print("Overall Upper Bound:", overall_upper_bound)
Confidence Interval for Count of Ratings:
(podcast_id
a00018b54eb342567c94dacfb2a3e504 1.000000
a00071f9aaae9ac725c3a586701abf4d -15.969287
a000aa69852b276565c4f5eb9cdd999b -1.208320
a0010b283ba17d282c7bb1f9709f0ac3 1.000000
a0013c50c1e6b24266fdeb10eed6eea7 1.000000
...
fffeb7d6d05f2b4c600fbebc828ca656 5.916641
ffff09ad9a175a57b1bbbdb3c1581ec0 1.000000
ffff1a7b221753187b1562bf638010fa 1.000000
ffff5db4b5db2d860c49749e5de8a36d -4.452413
ffff66f98c1adfc8d0d6c41bb8facfd0 1.000000
Name: rating, Length: 53240, dtype: float64,
podcast_id
a00018b54eb342567c94dacfb2a3e504 1.000000
a00071f9aaae9ac725c3a586701abf4d 19.969287
a000aa69852b276565c4f5eb9cdd999b 11.208320
a0010b283ba17d282c7bb1f9709f0ac3 1.000000
a0013c50c1e6b24266fdeb10eed6eea7 1.000000
...
fffeb7d6d05f2b4c600fbebc828ca656 22.083359
ffff09ad9a175a57b1bbbdb3c1581ec0 1.000000
ffff1a7b221753187b1562bf638010fa 1.000000
ffff5db4b5db2d860c49749e5de8a36d 10.452413
ffff66f98c1adfc8d0d6c41bb8facfd0 1.000000
Name: rating, Length: 53240, dtype: float64)
Overall Lower Bound: -15.969287064551526 Overall Upper Bound: 3480.769498138696
Statistical Hypotheses:
Null Hypothesis (H0): There is no difference in average ratings between podcasts and the number of people voting for them.
Alternative Hypothesis (H1): There is a difference in average ratings between podcasts and the number of people voting for them.
Hypothesis Testing:
To test the hypothesis regarding the difference in average ratings between podcasts and the number of people voting for them, the chi-square test was conducted. The test statistic obtained was 14814.31676859763, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis. This indicates that there is a sufficient evidence to conclude that there is a difference in average ratings between podcasts and the number of people voting for them.
contingency_table = pd.crosstab(sampled_data["category"], sampled_data["rating"])
chi2_stat, p_val, dof, expected = stats.chi2_contingency(contingency_table)
alpha = 0.05
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_val)
print("Degrees of Freedom:", dof)
if p_val < alpha:
print(
"Reject the null hypothesis. There is a significant association between ratings and podcast categories."
)
else:
print(
"Fail to reject the null hypothesis. There is no significant association between ratings and podcast categories."
)
print("Expected Frequencies Table:")
print(expected[:5])
Chi-square Statistic: 14814.31676859763 P-value: 0.0 Degrees of Freedom: 76 Reject the null hypothesis. There is a significant association between ratings and podcast categories. Expected Frequencies Table: [[ 1314.57588545 528.51554966 588.06895143 729.23669706 22199.6029164 ] [ 2236.23042975 899.05996894 1000.36650491 1240.50753593 37763.83556047] [ 2134.83806256 858.29591471 955.00913626 1184.26199258 36051.59489389] [ 1860.36332622 747.94536915 832.22423493 1032.00220114 31416.46486856] [ 334.55334246 134.50470653 149.66076548 185.58728876 5649.69389678]]
Category Rating Differences
Confidence Intervals:
The confidence interval for the mean of ratings is [-22.412409472864187, 28.412409472864187]. This interval provides a range of plausible values for the true population parameter, the total mean of ratings for all podcasts in the dataset, with a specified level of confidence 95%. The lower bound of the confidence interval -22.412409472864187 represents the lower estimate of the mean of ratings, while the upper bound 28.412409472864187 represents the upper estimate. We can be 95% confident that the true count of ratings falls within this interval.
mean_ratings = sampled_data.groupby("podcast_id")["rating"].mean()
sample_size = sampled_data.groupby("podcast_id")["rating"].count()
sample_std = sampled_data.groupby("podcast_id")["rating"].std()
confidence_level = 0.95
t_value = t.ppf((1 + confidence_level) / 2, df=sample_size - 1)
zero_std_mask = sample_std == 0
sample_std[zero_std_mask] = np.nan
sample_std_filled = sample_std.fillna(0)
small_sample_mask = sample_size < 2
t_value[small_sample_mask] = 0
sample_std[small_sample_mask] = np.nan
sample_std_filled = sample_std.fillna(0)
margin_of_error = t_value * sample_std_filled / np.sqrt(sample_size)
confidence_interval_means = (
mean_ratings - margin_of_error,
mean_ratings + margin_of_error,
)
print("Confidence Interval for Mean Rating:")
display(confidence_interval_means[:5])
lower_bounds = confidence_interval_means[0]
upper_bounds = confidence_interval_means[1]
overall_lower_bound = np.min(lower_bounds)
overall_upper_bound = np.max(upper_bounds)
print("Overall Lower Bound:", overall_lower_bound)
print("Overall Upper Bound:", overall_upper_bound)
Confidence Interval for Mean Rating:
(podcast_id
a00018b54eb342567c94dacfb2a3e504 5.000000
a00071f9aaae9ac725c3a586701abf4d 5.000000
a000aa69852b276565c4f5eb9cdd999b 5.000000
a0010b283ba17d282c7bb1f9709f0ac3 5.000000
a0013c50c1e6b24266fdeb10eed6eea7 5.000000
...
fffeb7d6d05f2b4c600fbebc828ca656 3.159423
ffff09ad9a175a57b1bbbdb3c1581ec0 5.000000
ffff1a7b221753187b1562bf638010fa 1.000000
ffff5db4b5db2d860c49749e5de8a36d 3.232449
ffff66f98c1adfc8d0d6c41bb8facfd0 5.000000
Name: rating, Length: 53240, dtype: float64,
podcast_id
a00018b54eb342567c94dacfb2a3e504 5.000000
a00071f9aaae9ac725c3a586701abf4d 5.000000
a000aa69852b276565c4f5eb9cdd999b 5.000000
a0010b283ba17d282c7bb1f9709f0ac3 5.000000
a0013c50c1e6b24266fdeb10eed6eea7 5.000000
...
fffeb7d6d05f2b4c600fbebc828ca656 5.126291
ffff09ad9a175a57b1bbbdb3c1581ec0 5.000000
ffff1a7b221753187b1562bf638010fa 1.000000
ffff5db4b5db2d860c49749e5de8a36d 6.100884
ffff66f98c1adfc8d0d6c41bb8facfd0 5.000000
Name: rating, Length: 53240, dtype: float64)
Overall Lower Bound: -22.412409472864187 Overall Upper Bound: 28.412409472864187
Statistical Hypotheses:
Null Hypothesis (H0): There are no significant differences in rating averages among categories.
Alternative Hypothesis (H1): There are significant differences in rating averages among categories.
Hypothesis Testing:
To test the hypothesis regarding the difference in average ratings between different podcasts categories, the Tukey's Honestly Significant Difference (HSD) test following an ANOVA were conducted. The test statistic obtained was 703.9440772581569, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis. This indicates that there is a sufficient evidence to conclude that there are differences in average ratings between podcasts categories.
category_groups = sampled_data.groupby("category")["rating"]
f_statistic, p_value = stats.f_oneway(*[group for name, group in category_groups])
alpha = 0.05
print("F-statistic:", f_statistic)
print("P-value:", p_value)
if p_value < alpha:
print(
"One-way ANOVA: There are significant differences in rating averages among categories."
)
tukey_results = pairwise_tukeyhsd(sampled_data["rating"], sampled_data["category"])
print(tukey_results.summary())
else:
print(
"One-way ANOVA: No significant differences in rating averages among categories were found."
)
F-statistic: 703.9440772581569 P-value: 0.0 One-way ANOVA: There are significant differences in rating averages among categories.
Multiple Comparison of Means - Tukey HSD, FWER=0.05
======================================================================================
group1 group2 meandiff p-adj lower upper reject
--------------------------------------------------------------------------------------
Arts Business & Finance 0.1126 0.0 0.0847 0.1406 True
Arts Comedy -0.1056 0.0 -0.1338 -0.0775 True
Arts Education 0.0943 0.0 0.0654 0.1232 True
Arts Fiction -0.1184 0.0 -0.1676 -0.0692 True
Arts Government -0.2996 0.0 -0.3908 -0.2084 True
Arts Health & Fitness 0.053 0.0 0.0237 0.0822 True
Arts History -0.2243 0.0 -0.275 -0.1736 True
Arts Kids & Family 0.0135 0.9985 -0.0215 0.0484 False
Arts Leisure 0.0215 0.7694 -0.0123 0.0553 False
Arts Music 0.0437 0.045 0.0004 0.087 True
Arts News & Politics -0.4029 0.0 -0.4323 -0.3735 True
Arts Others -0.0838 0.0089 -0.1576 -0.01 True
Arts Religion & Spirituality 0.0947 0.0 0.0643 0.1251 True
Arts Science -0.2021 0.0 -0.248 -0.1562 True
Arts Society & Culture -0.163 0.0 -0.1891 -0.137 True
Arts Sports & Recreation -0.1035 0.0 -0.1332 -0.0737 True
Arts TV & Film -0.1885 0.0 -0.2198 -0.1572 True
Arts Technology -0.1703 0.0 -0.2256 -0.115 True
Arts True Crime -0.5634 0.0 -0.5988 -0.528 True
Business & Finance Comedy -0.2183 0.0 -0.2426 -0.194 True
Business & Finance Education -0.0183 0.5222 -0.0435 0.0069 False
Business & Finance Fiction -0.2311 0.0 -0.2781 -0.184 True
Business & Finance Government -0.4123 0.0 -0.5024 -0.3221 True
Business & Finance Health & Fitness -0.0597 0.0 -0.0853 -0.0341 True
Business & Finance History -0.337 0.0 -0.3857 -0.2883 True
Business & Finance Kids & Family -0.0992 0.0 -0.1311 -0.0673 True
Business & Finance Leisure -0.0912 0.0 -0.1219 -0.0605 True
Business & Finance Music -0.0689 0.0 -0.1099 -0.028 True
Business & Finance News & Politics -0.5155 0.0 -0.5413 -0.4898 True
Business & Finance Others -0.1964 0.0 -0.2689 -0.124 True
Business & Finance Religion & Spirituality -0.0179 0.6849 -0.0448 0.0089 False
Business & Finance Science -0.3148 0.0 -0.3584 -0.2711 True
Business & Finance Society & Culture -0.2757 0.0 -0.2975 -0.2539 True
Business & Finance Sports & Recreation -0.2161 0.0 -0.2423 -0.19 True
Business & Finance TV & Film -0.3011 0.0 -0.329 -0.2733 True
Business & Finance Technology -0.2829 0.0 -0.3364 -0.2295 True
Business & Finance True Crime -0.6761 0.0 -0.7085 -0.6436 True
Comedy Education 0.2 0.0 0.1745 0.2254 True
Comedy Fiction -0.0128 1.0 -0.06 0.0344 False
Comedy Government -0.194 0.0 -0.2842 -0.1038 True
Comedy Health & Fitness 0.1586 0.0 0.1327 0.1845 True
Comedy History -0.1187 0.0 -0.1675 -0.0699 True
Comedy Kids & Family 0.1191 0.0 0.087 0.1512 True
Comedy Leisure 0.1271 0.0 0.0962 0.158 True
Comedy Music 0.1493 0.0 0.1083 0.1904 True
Comedy News & Politics -0.2972 0.0 -0.3232 -0.2712 True
Comedy Others 0.0219 1.0 -0.0507 0.0944 False
Comedy Religion & Spirituality 0.2003 0.0 0.1733 0.2274 True
Comedy Science -0.0965 0.0 -0.1403 -0.0527 True
Comedy Society & Culture -0.0574 0.0 -0.0795 -0.0353 True
Comedy Sports & Recreation 0.0021 1.0 -0.0242 0.0285 False
Comedy TV & Film -0.0829 0.0 -0.111 -0.0548 True
Comedy Technology -0.0646 0.0031 -0.1182 -0.0111 True
Comedy True Crime -0.4578 0.0 -0.4904 -0.4251 True
Education Fiction -0.2127 0.0 -0.2604 -0.165 True
Education Government -0.3939 0.0 -0.4844 -0.3035 True
Education Health & Fitness -0.0414 0.0 -0.0681 -0.0146 True
Education History -0.3186 0.0 -0.3679 -0.2694 True
Education Kids & Family -0.0809 0.0 -0.1137 -0.0481 True
Education Leisure -0.0729 0.0 -0.1045 -0.0413 True
Education Music -0.0506 0.0027 -0.0922 -0.009 True
Education News & Politics -0.4972 0.0 -0.524 -0.4703 True
Education Others -0.1781 0.0 -0.2509 -0.1053 True
Education Religion & Spirituality 0.0004 1.0 -0.0275 0.0283 False
Education Science -0.2964 0.0 -0.3408 -0.2521 True
Education Society & Culture -0.2573 0.0 -0.2805 -0.2342 True
Education Sports & Recreation -0.1978 0.0 -0.225 -0.1706 True
Education TV & Film -0.2828 0.0 -0.3117 -0.2539 True
Education Technology -0.2646 0.0 -0.3186 -0.2106 True
Education True Crime -0.6577 0.0 -0.6911 -0.6244 True
Fiction Government -0.1812 0.0 -0.28 -0.0824 True
Fiction Health & Fitness 0.1714 0.0 0.1235 0.2193 True
Fiction History -0.1059 0.0 -0.1692 -0.0426 True
Fiction Kids & Family 0.1319 0.0 0.0803 0.1834 True
Fiction Leisure 0.1399 0.0 0.0891 0.1906 True
Fiction Music 0.1621 0.0 0.1046 0.2197 True
Fiction News & Politics -0.2845 0.0 -0.3324 -0.2365 True
Fiction Others 0.0346 0.9959 -0.0483 0.1176 False
Fiction Religion & Spirituality 0.2131 0.0 0.1646 0.2617 True
Fiction Science -0.0837 0.0001 -0.1432 -0.0242 True
Fiction Society & Culture -0.0446 0.0701 -0.0906 0.0014 False
Fiction Sports & Recreation 0.0149 0.9999 -0.0333 0.0631 False
Fiction TV & Film -0.0701 0.0001 -0.1192 -0.0209 True
Fiction Technology -0.0519 0.3971 -0.1189 0.0152 False
Fiction True Crime -0.445 0.0 -0.4969 -0.3931 True
Government Health & Fitness 0.3526 0.0 0.262 0.4431 True
Government History 0.0753 0.4438 -0.0243 0.1748 False
Government Kids & Family 0.3131 0.0 0.2205 0.4056 True
Government Leisure 0.3211 0.0 0.2289 0.4132 True
Government Music 0.3433 0.0 0.2473 0.4393 True
Government News & Politics -0.1033 0.0083 -0.1939 -0.0127 True
Government Others 0.2158 0.0 0.1027 0.3289 True
Government Religion & Spirituality 0.3943 0.0 0.3034 0.4852 True
Government Science 0.0975 0.0484 0.0003 0.1947 True
Government Society & Culture 0.1366 0.0 0.047 0.2261 True
Government Sports & Recreation 0.1961 0.0 0.1054 0.2868 True
Government TV & Film 0.1111 0.0026 0.0199 0.2023 True
Government Technology 0.1293 0.0012 0.0274 0.2313 True
Government True Crime -0.2638 0.0 -0.3565 -0.1711 True
Health & Fitness History -0.2773 0.0 -0.3268 -0.2278 True
Health & Fitness Kids & Family -0.0395 0.0038 -0.0726 -0.0064 True
Health & Fitness Leisure -0.0315 0.0583 -0.0634 0.0004 False
Health & Fitness Music -0.0092 1.0 -0.0511 0.0326 False
Health & Fitness News & Politics -0.4558 0.0 -0.4831 -0.4286 True
Health & Fitness Others -0.1367 0.0 -0.2097 -0.0638 True
Health & Fitness Religion & Spirituality 0.0418 0.0 0.0135 0.07 True
Health & Fitness Science -0.2551 0.0 -0.2996 -0.2105 True
Health & Fitness Society & Culture -0.216 0.0 -0.2395 -0.1924 True
Health & Fitness Sports & Recreation -0.1564 0.0 -0.184 -0.1288 True
Health & Fitness TV & Film -0.2414 0.0 -0.2707 -0.2122 True
Health & Fitness Technology -0.2232 0.0 -0.2774 -0.1691 True
Health & Fitness True Crime -0.6164 0.0 -0.65 -0.5827 True
History Kids & Family 0.2378 0.0 0.1848 0.2908 True
History Leisure 0.2458 0.0 0.1935 0.2981 True
History Music 0.268 0.0 0.2092 0.3269 True
History News & Politics -0.1785 0.0 -0.2281 -0.129 True
History Others 0.1405 0.0 0.0567 0.2244 True
History Religion & Spirituality 0.319 0.0 0.2689 0.3692 True
History Science 0.0222 0.9993 -0.0386 0.083 False
History Society & Culture 0.0613 0.0009 0.0137 0.1089 True
History Sports & Recreation 0.1208 0.0 0.0711 0.1706 True
History TV & Film 0.0358 0.5793 -0.0148 0.0865 False
History Technology 0.0541 0.3489 -0.0141 0.1222 False
History True Crime -0.3391 0.0 -0.3924 -0.2857 True
Kids & Family Leisure 0.008 1.0 -0.0292 0.0452 False
Kids & Family Music 0.0303 0.7125 -0.0157 0.0762 False
Kids & Family News & Politics -0.4163 0.0 -0.4495 -0.3831 True
Kids & Family Others -0.0972 0.0008 -0.1726 -0.0218 True
Kids & Family Religion & Spirituality 0.0813 0.0 0.0472 0.1153 True
Kids & Family Science -0.2156 0.0 -0.264 -0.1672 True
Kids & Family Society & Culture -0.1765 0.0 -0.2067 -0.1462 True
Kids & Family Sports & Recreation -0.1169 0.0 -0.1505 -0.0834 True
Kids & Family TV & Film -0.2019 0.0 -0.2368 -0.1671 True
Kids & Family Technology -0.1837 0.0 -0.2411 -0.1263 True
Kids & Family True Crime -0.5769 0.0 -0.6155 -0.5382 True
Leisure Music 0.0223 0.9726 -0.0229 0.0674 False
Leisure News & Politics -0.4243 0.0 -0.4564 -0.3923 True
Leisure Others -0.1052 0.0001 -0.1801 -0.0303 True
Leisure Religion & Spirituality 0.0733 0.0 0.0403 0.1062 True
Leisure Science -0.2236 0.0 -0.2712 -0.176 True
Leisure Society & Culture -0.1845 0.0 -0.2135 -0.1555 True
Leisure Sports & Recreation -0.1249 0.0 -0.1573 -0.0926 True
Leisure TV & Film -0.2099 0.0 -0.2437 -0.1762 True
Leisure Technology -0.1917 0.0 -0.2485 -0.135 True
Leisure True Crime -0.5849 0.0 -0.6225 -0.5472 True
Music News & Politics -0.4466 0.0 -0.4885 -0.4046 True
Music Others -0.1275 0.0 -0.2071 -0.0478 True
Music Religion & Spirituality 0.051 0.0037 0.0084 0.0936 True
Music Science -0.2458 0.0 -0.3006 -0.191 True
Music Society & Culture -0.2067 0.0 -0.2464 -0.1671 True
Music Sports & Recreation -0.1472 0.0 -0.1894 -0.105 True
Music TV & Film -0.2322 0.0 -0.2755 -0.1889 True
Music Technology -0.214 0.0 -0.2769 -0.1511 True
Music True Crime -0.6071 0.0 -0.6535 -0.5608 True
News & Politics Others 0.3191 0.0 0.2461 0.3921 True
News & Politics Religion & Spirituality 0.4976 0.0 0.4692 0.526 True
News & Politics Science 0.2007 0.0 0.1561 0.2454 True
News & Politics Society & Culture 0.2398 0.0 0.2161 0.2635 True
News & Politics Sports & Recreation 0.2994 0.0 0.2717 0.3271 True
News & Politics TV & Film 0.2144 0.0 0.185 0.2437 True
News & Politics Technology 0.2326 0.0 0.1784 0.2868 True
News & Politics True Crime -0.1606 0.0 -0.1943 -0.1268 True
Others Religion & Spirituality 0.1785 0.0 0.1051 0.2519 True
Others Science -0.1183 0.0 -0.1994 -0.0373 True
Others Society & Culture -0.0792 0.0134 -0.151 -0.0075 True
Others Sports & Recreation -0.0197 1.0 -0.0929 0.0534 False
Others TV & Film -0.1047 0.0001 -0.1785 -0.0309 True
Others Technology -0.0865 0.0517 -0.1732 0.0003 False
Others True Crime -0.4796 0.0 -0.5553 -0.404 True
Religion & Spirituality Science -0.2968 0.0 -0.3421 -0.2516 True
Religion & Spirituality Society & Culture -0.2577 0.0 -0.2826 -0.2328 True
Religion & Spirituality Sports & Recreation -0.1982 0.0 -0.2269 -0.1695 True
Religion & Spirituality TV & Film -0.2832 0.0 -0.3135 -0.2529 True
Religion & Spirituality Technology -0.265 0.0 -0.3197 -0.2102 True
Religion & Spirituality True Crime -0.6581 0.0 -0.6927 -0.6235 True
Science Society & Culture 0.0391 0.1175 -0.0034 0.0816 False
Science Sports & Recreation 0.0986 0.0 0.0538 0.1435 True
Science TV & Film 0.0136 1.0 -0.0322 0.0595 False
Science Technology 0.0319 0.9729 -0.0328 0.0965 False
Science True Crime -0.3613 0.0 -0.4101 -0.3125 True
Society & Culture Sports & Recreation 0.0595 0.0 0.0354 0.0837 True
Society & Culture TV & Film -0.0255 0.0626 -0.0515 0.0005 False
Society & Culture Technology -0.0073 1.0 -0.0597 0.0452 False
Society & Culture True Crime -0.4004 0.0 -0.4313 -0.3695 True
Sports & Recreation TV & Film -0.085 0.0 -0.1147 -0.0553 True
Sports & Recreation Technology -0.0668 0.0023 -0.1212 -0.0124 True
Sports & Recreation True Crime -0.4599 0.0 -0.494 -0.4259 True
TV & Film Technology 0.0182 0.9998 -0.0371 0.0735 False
TV & Film True Crime -0.3749 0.0 -0.4103 -0.3395 True
Technology True Crime -0.3931 0.0 -0.4509 -0.3354 True
--------------------------------------------------------------------------------------
Temporal Analysis
Confidence Intervals:
The confidence interval for the mean rating score is [4.631971527166266, 4.646939640938188]. This interval provides a range of plausible values for the true population parameter, the total mean of ratings for all podcasts in the dataset, with a specified level of confidence of 95%. The lower bound of the confidence interval represents the lower estimate of the mean rating score, while the upper bound represents the upper estimate. We can be 95% confident that the true mean rating score falls within this interval.
sampled_data["created_at"] = pd.to_datetime(sampled_data["created_at"])
sampled_data["month"] = sampled_data["created_at"].dt.month
monthly_rating = sampled_data.groupby("month")["rating"].mean()
mean_rating = monthly_rating.mean()
std_rating = monthly_rating.std()
std_error = std_rating / len(monthly_rating) ** 0.5
# T-score for 95% confidence level
t_score = t.ppf(0.975, df=len(monthly_rating) - 1)
confidence_intervals = []
for month, rating in monthly_rating.items():
lower_bound = rating - t_score * std_error
upper_bound = rating + t_score * std_error
confidence_intervals.append((month, lower_bound, upper_bound))
print("Confidence intervals for mean rating by each month:")
for month, lower, upper in confidence_intervals:
print(f"Month {month}: ({lower}, {upper})")
Confidence intervals for mean rating by each month: Month 1: (4.640083007927041, 4.655051121698962) Month 2: (4.667997531372459, 4.68296564514438) Month 3: (4.6466592300946035, 4.661627343866525) Month 4: (4.66141280995477, 4.676380923726692) Month 5: (4.658119006292087, 4.673087120064008) Month 6: (4.631688997949905, 4.646657111721827) Month 7: (4.648462691048918, 4.663430804820839) Month 8: (4.656061846472697, 4.671029960244618) Month 9: (4.645996759562044, 4.660964873333965) Month 10: (4.6458571438811145, 4.660825257653036) Month 11: (4.634314514342503, 4.649282628114424) Month 12: (4.631971527166266, 4.646939640938188)
Statistical Hypotheses:
Null Hypothesis (H0): There are specific time periods are associated with higher or lower ratings.
Alternative Hypothesis (H1): There are no specific time periods are associated with higher or lower ratings.
Hypothesis Testing:
To test the hypothesis regarding correlation between reviews time periods and ratings, the one-way ANOVA was checked. The test statistic obtained was 5.048049115266022, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis. This indicates that there is a sufficient evidence to conclude that there are specific time periods are associated with higher or lower ratings.
month_groups = [group["rating"] for _, group in sampled_data.groupby("month")]
f_statistic, p_value = f_oneway(*month_groups)
print(f"P-value is: {p_val}, and test statistics is: {f_statistic}")
if p_value < 0.05:
print("There are significant differences in ratings among different months.")
else:
print("No significant differences in ratings among different months were found.")
P-value is: 0.0, and test statistics is: 5.048049115266022 There are significant differences in ratings among different months.
tukey_results = pairwise_tukeyhsd(sampled_data["rating"], sampled_data["month"])
print(tukey_results.summary())
Multiple Comparison of Means - Tukey HSD, FWER=0.05
====================================================
group1 group2 meandiff p-adj lower upper reject
----------------------------------------------------
1 2 0.0279 0.0058 0.0044 0.0514 True
1 3 0.0066 0.9991 -0.0171 0.0303 False
1 4 0.0213 0.1273 -0.0024 0.045 False
1 5 0.018 0.3398 -0.0055 0.0416 False
1 6 -0.0084 0.9916 -0.032 0.0152 False
1 7 0.0084 0.9918 -0.0152 0.032 False
1 8 0.016 0.5224 -0.0074 0.0393 False
1 9 0.0059 0.9996 -0.0172 0.0291 False
1 10 0.0058 0.9997 -0.0173 0.0289 False
1 11 -0.0058 0.9998 -0.0297 0.0181 False
1 12 -0.0081 0.9949 -0.0323 0.016 False
2 3 -0.0213 0.1426 -0.0454 0.0028 False
2 4 -0.0066 0.9992 -0.0307 0.0175 False
2 5 -0.0099 0.9729 -0.0339 0.0141 False
2 6 -0.0363 0.0 -0.0603 -0.0123 True
2 7 -0.0195 0.2469 -0.0435 0.0045 False
2 8 -0.0119 0.8933 -0.0357 0.0118 False
2 9 -0.022 0.0943 -0.0456 0.0016 False
2 10 -0.0221 0.0877 -0.0457 0.0014 False
2 11 -0.0337 0.0004 -0.058 -0.0094 True
2 12 -0.036 0.0001 -0.0606 -0.0115 True
3 4 0.0148 0.7053 -0.0096 0.0391 False
3 5 0.0115 0.9268 -0.0127 0.0356 False
3 6 -0.015 0.679 -0.0392 0.0092 False
3 7 0.0018 1.0 -0.0224 0.026 False
3 8 0.0094 0.9813 -0.0145 0.0334 False
3 9 -0.0007 1.0 -0.0244 0.0231 False
3 10 -0.0008 1.0 -0.0245 0.0229 False
3 11 -0.0123 0.8916 -0.0368 0.0122 False
3 12 -0.0147 0.7342 -0.0394 0.0101 False
4 5 -0.0033 1.0 -0.0275 0.0209 False
4 6 -0.0297 0.0035 -0.0539 -0.0055 True
4 7 -0.013 0.8464 -0.0372 0.0113 False
4 8 -0.0054 0.9999 -0.0293 0.0186 False
4 9 -0.0154 0.6101 -0.0392 0.0084 False
4 10 -0.0156 0.5929 -0.0393 0.0082 False
4 11 -0.0271 0.016 -0.0516 -0.0026 True
4 12 -0.0294 0.0058 -0.0542 -0.0047 True
5 6 -0.0264 0.0175 -0.0505 -0.0023 True
5 7 -0.0097 0.9781 -0.0337 0.0144 False
5 8 -0.0021 1.0 -0.0259 0.0218 False
5 9 -0.0121 0.8799 -0.0358 0.0115 False
5 10 -0.0123 0.8698 -0.0359 0.0113 False
5 11 -0.0238 0.0632 -0.0482 0.0006 False
5 12 -0.0261 0.0262 -0.0508 -0.0015 True
6 7 0.0168 0.4963 -0.0073 0.0409 False
6 8 0.0244 0.0401 0.0005 0.0482 True
6 9 0.0143 0.711 -0.0094 0.038 False
6 10 0.0142 0.7212 -0.0095 0.0378 False
6 11 0.0026 1.0 -0.0218 0.027 False
6 12 0.0003 1.0 -0.0244 0.0249 False
7 8 0.0076 0.9968 -0.0163 0.0315 False
7 9 -0.0025 1.0 -0.0262 0.0212 False
7 10 -0.0026 1.0 -0.0262 0.021 False
7 11 -0.0141 0.7634 -0.0386 0.0103 False
7 12 -0.0165 0.5605 -0.0412 0.0082 False
8 9 -0.0101 0.963 -0.0335 0.0134 False
8 10 -0.0102 0.9585 -0.0336 0.0132 False
8 11 -0.0217 0.1262 -0.0459 0.0024 False
8 12 -0.0241 0.0569 -0.0485 0.0003 False
9 10 -0.0001 1.0 -0.0233 0.023 False
9 11 -0.0117 0.9124 -0.0357 0.0123 False
9 12 -0.014 0.765 -0.0383 0.0102 False
10 11 -0.0115 0.9179 -0.0355 0.0124 False
10 12 -0.0139 0.7743 -0.0381 0.0103 False
11 12 -0.0023 1.0 -0.0273 0.0226 False
----------------------------------------------------
cnx.close()
OUTCOMES:
- To test the hypothesis regarding the difference in average ratings between podcasts and the number of people voting for them, the chi-square test was conducted. The test statistic obtained was 14814.31676859763, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.
- To test the hypothesis regarding the difference in average ratings between different podcasts categories, the Tukey's Honestly Significant Difference (HSD) test following an ANOVA was conducted. The test statistic obtained was 703.9440772581569, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis. To test the hypothesis regarding the difference in average ratings between different months, an analysis of variance (ANOVA) was conducted. The F-statistic obtained was 5.048049115266022, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.
INSIGHTS:
- The results of the chi-square test indicate that there is sufficient evidence to conclude that there is a difference in average ratings between podcasts and the number of people voting for them.
- The results of the HSD and ANOVA tests indicate that there is sufficient evidence to conclude that there are differences in average ratings between podcasts categories. Specifically Arts category received significantly higher ratings compared to other categories, while Government, Health & Fitness, News & Politics, and True Crime categories received significantly lower ratings than other categories.
- February and June tend to have higher average ratings compared to other months. October and November tend to have lower average ratings compared to other months. There is no significant difference in ratings between February and June, indicating that both months have similarly high ratings. Similarly, there is no significant difference in ratings between October and November, indicating that both months have similarly low ratings.
¶
SUMMARY:
Difference in Podcast Ratings and Voting Counts:
- The chi-square test reveals a significant difference between average ratings and the number of people voting for podcasts. This suggests that while some podcasts may receive high ratings, the number of votes can vary significantly, indicating potential disparities in audience engagement and reach.
Variation in Average Ratings Across Podcast Categories:
- The HSD and ANOVA tests demonstrate notable differences in average ratings across different podcast categories. Categories such as Arts tend to receive higher ratings, whereas Government, Health & Fitness, News & Politics, and True Crime categories receive lower ratings on average. Understanding these variations can help in tailoring content and marketing strategies to better suit audience preferences within each category.
Seasonal Trends in Podcast Ratings:
- Monthly analysis reveals variations in average ratings across different months. February and June exhibit higher average ratings compared to other months, while October and November tend to have lower ratings. Further investigation into the reasons behind these seasonal trends can provide insights for content scheduling and promotion strategies.
¶
INSIGHTS:
- The significant difference between podcast ratings and voting counts suggests that while some podcasts may have high satisfaction levels among listeners, their reach and engagement might not align with their quality. This calls for strategies to enhance visibility and engagement for high-quality podcasts.
¶
POTENTIAL AREAS FOR INVESTIGATION:
Observations Relevant to Stakeholders:
- Explore factors influencing the observed differences in voting counts among podcasts with similar average ratings. This could involve analyzing promotion strategies, audience demographics, or platform visibility.
- Explore audience feedback and preferences to identify areas for content improvement. This could involve analyzing listener reviews, ratings, and engagement metrics to understand what resonates most with the audience and adjust content strategies accordingly.
Observations Relevant to Analysts (processing would require more data):
- Examine additional variables that may contribute to the observed variations in average ratings across podcast categories, such as content format, host expertise, or audience demographics.
- Further investigate the impact of promotional activities, release schedules, and episode lengths on podcast ratings to optimize content strategies for maximizing audience satisfaction and engagement.